JPX Tokyo Stock Exchange Prediction

1 | Competition Overview

Success in any financial market requires one to identify solid investments. When a stock or derivative is undervalued, it makes sense to buy. If it's overvalued, perhaps it's time to sell. While these finance decisions were historically made manually by professionals, technology has ushered in new opportunities for retail investors. Data scientists, specifically, may be interested to explore quantitative trading, where decisions are executed programmatically based on predictions from trained models. In this competition, financial data for the Japanese market will be provided, allowing retail investors to analyze the market to the fullest extent.

Japan Exchange Group, Inc. (JPX) is a holding company operating one of the largest stock exchanges in the world, Tokyo Stock Exchange (TSE), and derivatives exchanges Osaka Exchange (OSE) and Tokyo Commodity Exchange (TOCOM). JPX is hosting this competition and is supported by AI technology company AlpacaJapan Co., Ltd.

This competition will involve building portfolios from the stocks eligible for predictions. Specifically, each participant ranks the stocks from highest to lowest expected returns and is evaluated on the difference in returns between the top and bottom 50 stocks. You'll have access to financial data from the Japanese market, such as stock information and historical stock prices to train and test your model. After the training phase is complete, the competition will compare your models against real future returns.

The training data begins in January 2017 with about 1,860 stocks with new stocks added through December 2020 for a total of 2,000 stocks. Below is a brief overview and descriptive statistics of the variables in the training data.

For Installing plotly and lightgbm model in juypter notebook following command are required:-

1)conda install -c https://conda.anaconda.org/plotly plotly

2)conda install -c conda-forge lightgbm

This below blocks will not be executed. this is alternative way that how to import dataset from kaggle itself.

Import Libraries:

2 | Exploratory Data Analysis

The graphs above show the market's average stock return, closing price, and shares traded since January 2017. While there has been much fluctuation over the past four years, the number of shares traded has slightly decreased from the volume in early 2017.

In 2021, nearly all industries saw a positive return on average, with the highest in Energy Resources at about 0.13% overall, while in 2018, all sectors saw a negative return except for Electric Power & Gas.

Since some of the stocks were added in December 2020, We will use the data filtered after this date so that the data will consist of 231 days of stock prices for all 2,000 stocks.

While most sectors have returns between 10% and -10%, there are quite a few outliers across all industries, with some returns as high as 62% in Commercial & Wholesale Trade and others as low as -31% in IT & Services sector. The graph below shows the stock price movements within each sector.

In the candlestick charts above, the boxes represent the daily spread between the open and close prices and the lines represent the spread between the low and high prices. The color of the boxes indicates whether the close price was greater or lower than the open price, with green indicating a higher closing price on that day and red indicating a lower closing price. In late August, the market saw a consecutive 14-day period where the close price was greater than the open price.

Among stocks with the highest return on average since December 2020, 6 of the 7 were in the IT & Services sector and one was in the Pharmaceutical sector. The IT & Services and Pharmaceutical sectors also make up 6 of the stocks with the lowest returns on average. The graph below shows the relationships between the top 5 stocks with the highest average returns, Enechange Ltd., For Startups, Inc., Symbio Pharmaceuticals, Fronteo, Inc., and Emnet Japan Co. Ltd.

3 | Feature Engineering

Before creating additional features, I will first generate the adjusted close price using the function provided by the competition hosts in the Train Demo Notebook, which will adjust the close price to account for any stock splits or reverse splits. Using the stock's adjusted close price, I will create a set of new features, including the stock price moving average, exponential moving average, return, and volatility, each over a period of 5, 10, 20, 30, and 50 days. These features are shown in the graphs below across each sector.

In the graphs of the stock price Moving Averages (MA) and Exponential Moving Averages (EMA), when the shorter period, the 10-Day average, crosses above the longer period, the 50-day average, the closing price tends to decrease, which is typically indicative of a sell signal. Conversely, when the 10-Day MA/EMA crosses the 50-Day MA/EMA from below, the closing price increases, which typically indicates a buy signal. In addition, the Exponential Moving Averages tend to respond to stock price changes more quickly as it puts greater weight on more recent observations. In the graphs of the Stock Return, there is greater fluctuation between longer periods, while Stock Volatility tends to be more stable over time, with more fluctuation in shorter periods.

4 | Stock Price Prediction

Based on the Sharpe Ratio of the daily spread of returns. For each forecast day, all active stocks will be ranked in order of their predicted return. The returns for a single day treat the 50 highest (e.g. 0 to 49) ranked stocks as purchased and the lowest (e.g. 1850 to 1999) ranked 50 stocks as shorted. The stocks are then weighted based on their ranks and the total returns for the portfolio are calculated assuming the stocks were purchased the next day and sold the day after that.

Since risk control is also an important element of investment, the competing score is the mean/standard deviation of the time series of daily spread returns, rather than the simple mean or sum of daily spread returns. This makes it necessary to build a model that can respond to changes in the distribution of data and produce stable and high performance, rather than a model that only wins big on certain days.

Refererence